Skip to content

Conversation

@kuba-moo
Copy link
Contributor

Reusable PR for hooking netdev CI to BPF testing.

@kernel-patches-daemon-bpf kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 3 times, most recently from 4f22ee0 to 8a9a8e0 Compare March 28, 2024 04:46
@kuba-moo kuba-moo force-pushed the to-test branch 11 times, most recently from 64c403f to 8da1f58 Compare March 29, 2024 00:01
@kernel-patches-daemon-bpf kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 3 times, most recently from 78ebb17 to 9325308 Compare March 29, 2024 02:14
@kuba-moo kuba-moo force-pushed the to-test branch 6 times, most recently from c8c7b2f to a71aae6 Compare March 29, 2024 18:01
@kuba-moo kuba-moo force-pushed the to-test branch 2 times, most recently from d8feb00 to b16a6b9 Compare March 30, 2024 00:01
@kuba-moo kuba-moo force-pushed the to-test branch 2 times, most recently from 4164329 to c5cecb3 Compare March 30, 2024 06:00
edumazet and others added 29 commits November 10, 2025 04:00
Qdisc use shinfo->gso_segs for their pkts stats in bstats_update(),
but this field needs to be initialized for SKB_GSO_DODGY users.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
…it()

qdisc_pkt_len_init() is currently initalizing qdisc_skb_cb(skb)->pkt_len

Add qdisc_skb_cb(skb)->pkt_segs initialization and rename this function
to qdisc_pkt_len_segs_init().

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
Avoid up to two cache line misses in qdisc dequeue() to fetch
skb_shinfo(skb)->gso_segs/gso_size while qdisc spinlock is held.

This gives a 5 % improvement in a TX intensive workload.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
Use new qdisc_pkt_segs() to avoid a cache line miss in cake_enqueue()
for non GSO packets.

cake_overhead() does not have to recompute it.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
It is possible to reorg Qdisc to avoid always dirtying 2 cache lines in
fast path by reducing this to a single dirtied cache line.

In current layout, we change only four/six fields in the first cache line:
 - q.spinlock
 - q.qlen
 - bstats.bytes
 - bstats.packets
 - some Qdisc also change q.next/q.prev

In the second cache line we change in the fast path:
 - running
 - state
 - qstats.backlog

        /* --- cacheline 2 boundary (128 bytes) --- */
        struct sk_buff_head        gso_skb __attribute__((__aligned__(64))); /*  0x80  0x18 */
        struct qdisc_skb_head      q;                    /*  0x98  0x18 */
        struct gnet_stats_basic_sync bstats __attribute__((__aligned__(16))); /*  0xb0  0x10 */

        /* --- cacheline 3 boundary (192 bytes) --- */
        struct gnet_stats_queue    qstats;               /*  0xc0  0x14 */
        bool                       running;              /*  0xd4   0x1 */

        /* XXX 3 bytes hole, try to pack */

        unsigned long              state;                /*  0xd8   0x8 */
        struct Qdisc *             next_sched;           /*  0xe0   0x8 */
        struct sk_buff_head        skb_bad_txq;          /*  0xe8  0x18 */
        /* --- cacheline 4 boundary (256 bytes) --- */

Reorganize things to have a first cache line mostly read,
then a mostly written one.

This gives a ~3% increase of performance under tx stress.

Note that there is an additional hols because @QStats now spans over a third cache line.

	/* --- cacheline 2 boundary (128 bytes) --- */
	__u8                       __cacheline_group_begin__Qdisc_read_mostly[0] __attribute__((__aligned__(64))); /*  0x80     0 */
	struct sk_buff_head        gso_skb;              /*  0x80  0x18 */
	struct Qdisc *             next_sched;           /*  0x98   0x8 */
	struct sk_buff_head        skb_bad_txq;          /*  0xa0  0x18 */
	__u8                       __cacheline_group_end__Qdisc_read_mostly[0]; /*  0xb8     0 */

	/* XXX 8 bytes hole, try to pack */

	/* --- cacheline 3 boundary (192 bytes) --- */
	__u8                       __cacheline_group_begin__Qdisc_write[0] __attribute__((__aligned__(64))); /*  0xc0     0 */
	struct qdisc_skb_head      q;                    /*  0xc0  0x18 */
	unsigned long              state;                /*  0xd8   0x8 */
	struct gnet_stats_basic_sync bstats __attribute__((__aligned__(16))); /*  0xe0  0x10 */
	bool                       running;              /*  0xf0   0x1 */

	/* XXX 3 bytes hole, try to pack */

	struct gnet_stats_queue    qstats;               /*  0xf4  0x14 */
	/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
	__u8                       __cacheline_group_end__Qdisc_write[0]; /* 0x108     0 */

	/* XXX 56 bytes hole, try to pack */

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
Group together changes to qdisc fields to reduce chances of false sharing
if another cpu attempts to acquire the qdisc spinlock.

  qdisc_qstats_backlog_dec(sch, skb);
  sch->q.qlen--;
  qdisc_bstats_update(sch, skb);

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
prefetch the skb that we are likely to dequeue at the next dequeue().

Also call fq_dequeue_skb() a bit sooner in fq_dequeue().

This reduces the window between read of q.qlen and
changes of fields in the cache line that could be dirtied
by another cpu trying to queue a packet.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
Most qdiscs need to read skb->priority at enqueue time().
__dev_xmit_skb()

In commit 100dfa7 ("net: dev_queue_xmit() llist adoption")
I added a prefetch(next), lets add another one for the second
half of skb.

Note that skb->priority and skb->hash share a common cache line,
so this patch helps qdiscs needing both fields.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
q->limit is read locklessly, add a READ_ONCE().

Fixes: 100dfa7 ("net: dev_queue_xmit() llist adoption")
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
Implement .ndo_tx_timeout for MANA so any stalled TX queue can be detected
and a device-controlled port reset for all queues can be scheduled to a
ordered workqueue. The reset for all queues on stall detection is
recomended by hardware team.

The change introduces a single ordered workqueue
("mana_per_port_queue_reset_wq") with WQ_UNBOUND | WQ_MEM_RECLAIM and
queues exactly one work_struct per port onto it.

Reviewed-by: Pavan Chebbi <[email protected]>
Reviewed-by: Haiyang Zhang <[email protected]>
Signed-off-by: Dipayaan Roy <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
The AST2600 contains two dies, each with its own MAC, and these MACs
require different delay configurations.
Previously, these delay values were configured during the bootloader
stage rather than in the driver. This change introduces the use of the
standard properties defined in ethernet-controller.yaml to configure
the delay values directly in the driver.

Add the new property, "aspeed,rgmii-delay-ps", to specify per step of
RGMII delay in different MACs. And for Aspeed platform, the total steps
of RGMII delay configuraion is 32 steps, so the total delay is
"apseed,rgmii-delay-ps' * 32.
Default delay values are declared so that tx-internal-delay-ps and
rx-internal-delay-ps become optional. If these properties are not present,
the driver will use the default values instead.
Add conditional schema constraints for Aspeed AST2600 MAC controllers:
- For MAC0/1, aspeed,rgmii-delay-ps property is 45 ps
- For MAC2/3, aspeed,rgmii-delay-ps property is 250 ps
- Both require the "aspeed,scu" and "aspeed,rgmii-delay-ps" properties.
Other compatible values remain unrestricted.

Signed-off-by: Jacky Chou <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
The RGMII delay is configured in SCU region in Aspeed AST2600,
therefore, add aspeed,scu property in dtsi for rgmii delay.
And the RGMII delay value in each MAC is different.
List below:
MAC0 and MAC1 -> 45 ps
MAC2 and MAC3 -> 250 ps
Add "aspeed,rgmii-delay-ps" property for each MAC to specify the
corresponding delay value.

Signed-off-by: Jacky Chou <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
This change sets the rx-internal-delay-ps and tx-internal-delay-ps
properties to control the RGMII signal delay.
The phy-mode for MAC0–MAC3 is updated to "rgmii-id" to enable TX/RX
internal delay on the PHY and disable the corresponding delay
on the MAC.

Signed-off-by: Jacky Chou <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
On the AST2600 platform, the RGMII delay is controlled via the
SCU registers. The delay chain configuration differs between MAC0/1
and MAC2/3, even though all four MACs use a 32-stage delay chain.
+------+----------+-----------+-------------+-------------+
|      |Delay Unit|Delay Stage|TX Edge Stage|RX Edge Stage|
+------+----------+-----------+-------------+-------------+
|MAC0/1|     45 ps|        32 |           0 |           0 |
+------+----------+-----------+-------------+-------------+
|MAC2/3|    250 ps|        32 |           0 |          26 |
+------+----------+-----------+-------------+-------------+
For MAC2/3, the "no delay" condition starts from stage 26.
Setting the RX delay stage to 26 means that no additional RX
delay is applied.
Here lists the RX delay setting of MAC2/3 below.
26 -> 0   ns, 27 -> 0.25 ns, ... , 31 -> 1.25 ns,
0  -> 1.5 ns, 1  -> 1.75 ns, ... , 25 -> 7.75 ns

Therefore, we calculate the delay stage from the
rx-internal-delay-ps of MAC2/3 to add 26. If the stage is equel
to or bigger than 32, the delay stage will be mask 0x1f to get
the correct setting.
The delay chain is like a ring for configuration.
Example for the rx-internal-delay-ps of MAC2/3 is 2000 ps,
we will get the delay stage is 2.

Strating to this patch, driver will remind the legacy dts to update the
"phy-mode" to "rgmii-id, and add the corresponding rgmii delay with
"rx-internal-delay-id" and "tx-internal-delay-id".
If lack these properties, driver will configure the default rgmii delay,
that means driver will disable the TX and RX delay in MAC side.

Signed-off-by: Jacky Chou <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
Alex will send phylink patches soon which will make us link up
on QEMU again, but for now let's hack up the link. Gives us
a chance to add another QEMU NIC test to "HW" runners in the CI.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
Let's see if this increases stability of timing-related results..

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
Signed-off-by: NipaLocal <nipa@local>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
These are unlikely to matter for CI testing and they slow things down.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
tc_actions.sh keeps hanging the forwarding tests.

sdf@: tdc & tdc-dbg started intermittenly failing around Sep 25th

Signed-off-by: NipaLocal <nipa@local>
Signed-off-by: NipaLocal <nipa@local>
We exclusively use headless VMs today, don't waste time
compiling sound and GPU drivers.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kmemleak auto scan could be a source of latency for the tests.
We run a full scan after the tests manually, we don't need
the autoscan thread to be enabled.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.